Optimally Combining Positive and Negative Features for Text Categorization
نویسندگان
چکیده
This paper presents a novel local feature selection approach for text categorization. It constructs a feature set for each category by first selecting a set of terms highly indicative of membership as well as another set of terms highly indicative of non-membership, then unifying the two sets. The size ratio of the two sets was empirically chosen to obtain optimal performance. This is in contrast with the standard local feature selection approaches that either (1) only select the terms most indicative of membership; or (2) implicitly but not optimally combine the terms most indicative of membership with non-membership. The experimental comparison between the proposed approach and standard approaches was conducted on four feature selection metrics: chisquare, correlation coefficient, odds ratio, and GSS coefficient. The results show that the proposed approach improves text categorization performance.
منابع مشابه
Improving the Operation of Text Categorization Systems with Selecting Proper Features Based on PSO-LA
With the explosive growth in amount of information, it is highly required to utilize tools and methods in order to search, filter and manage resources. One of the major problems in text classification relates to the high dimensional feature spaces. Therefore, the main goal of text classification is to reduce the dimensionality of features space. There are many feature selection methods. However...
متن کاملMeta-Classification using SVM Classifiers for Text Documents
Text categorization is the problem of classifying text documents into a set of predefined classes. In this paper, we investigated three approaches to build a meta-classifier in order to increase the classification accuracy. The basic idea is to learn a metaclassifier to optimally select the best component classifier for each data point. The experimental results show that combining classifiers c...
متن کاملExploiting Associations between Class Labels in Multi-label Classification
Multi-label classification has many applications in the text categorization, biology and medical diagnosis, in which multiple class labels can be assigned to each training instance simultaneously. As it is often the case that there are relationships between the labels, extracting the existing relationships between the labels and taking advantage of them during the training or prediction phases ...
متن کاملآشکارسازی و تعیین مکان متون فارسی - عربی در تصاویر ویدیویی
Video text detection plays an important role in applications such as semantic-based video analysis, text information retrieval, archiving and so on. In this paper, we propose a Farsi/Arabic text detection approach. First, with an appropriate edge detector, edges are extracted and then by using edges cross ponts, artificial corners are extracted. Artificial corner histogram analysis is done for ...
متن کاملCombining Local Feature Scoring Methods for Text Categorization
Dimensionality reduction is an important process in text categorization. Feature scoring methods are used in order to realize this reduction. Features are evaluated and selection is performed according to a certain threshold. In this paper, we propose combining pairs of high-performing feature scoring methods to enhance text categorization. We analyzed the performance of constructing this combi...
متن کامل